General Information: In the following dataset, we have chemicales that are added in the wine and will effect postivily or the opposite, so our goal is to find out which chemicale has the most influnce on the quility of the Red Wine

Univariate Plots Section

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Total number of wines is 1599 the Max Quailty is 8/10 and the Min is 3/10. All of them contain Sugar and Chorides(salt) because the min isn’t ZERO pH is great because it’s from 2.7 to 4 which is a great pH SCALE!!! ^^

table(rw) Our table is to large ‘a table with >= 2^31 elements’

First, we want to see were most of are wine’s quality is so we can see how good the wines are in general. The following plot is used to check the quantity and quality of RedWine Well, it’s not that great, most are between 5 and 7, but we’ll see the reasons behind them :)


I was curious about the Distribution of Chemical Properties and what had the most effect on the quality of the wine, so I decided to plot them using histogram.

based on the above plot, we can see that it’s almost a normal distribution and most our wine contains less than 12 ‘fixed.acidity’, maybe we need more to have a better quality, we’ll see.


Again, a close to normal distribution plot, we can also see that most and almost all wine contain less than 1.2 ‘volatile.acidity’, we’ll see the effect of it on the quality later on.


Now let’s continue and plot the rest so we can see how they are ploted

Wow, most of our plots so far are close to normal distribution! also, we can see that most wine contains <=0.50 ‘citric.acid’ ——

this one is different, a right skewed plot…. But something more interesting, it has less than 4 ‘residual.sugar’, we’ll understand it’s affect on the quality later on.


normal distribution… but it looks like it has very low chlorides, hmm we’ll see it’s effect later on and understand it more :)

Low pH, maybe that’s the reason behind the non-perfect quality hmm.. we’ll dig deeper later on. ——

right skewed… but we can see that most of our wine has less alohol.. maybe this is the reason behind the quality ranking hmmm

So we have seen a lot of right skewed, maybe they are the reason behind not having almost great quality (9 or 10). But let us see


Now we’ll get the correlations between different var. * quality

## 
##  Pearson's product-moment correlation
## 
## data:  rw$fixed.acidity and rw$quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516
## 
##  Pearson's product-moment correlation
## 
## data:  rw$volatile.acidity and rw$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
## 
##  Pearson's product-moment correlation
## 
## data:  rw$citric.acid and rw$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725
## 
##  Pearson's product-moment correlation
## 
## data:  rw$residual.sugar and rw$quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164
## 
##  Pearson's product-moment correlation
## 
## data:  rw$chlorides and rw$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066
## 
##  Pearson's product-moment correlation
## 
## data:  rw$alcohol and rw$alcohol
## t = 1896300000, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1

Univariate Analysis

What is the structure of your dataset?

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

1599 observations, and 13 variables

What is/are the main feature(s) of interest in your dataset?

In the Quailty, and it’s between 0-10

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

volitile acidity, citric acid, residual sugar, and chlorides will be the best predictors. All of those seem to do with taste.

Did you create any new variables from existing variables in the dataset?

No, no need to create new variables

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The plot is noisy due to the limit scale and the large data points

Bivariate Plots Section

Bivariate Analysis

We have already calculated the correlation between quality and different var. now we want to have a closer look at them

It looks like as volatile acidity increases, quality decreases, although there are two observations worth mentioning:

These findings agree with the information provided by the authors of the dataset: “too high levels can lead to an unpleasant, vinegar taste”.

The variables ‘quality’ and ‘citric.acid’ are positively correlated. However, wines with a quality score of seven or eight present very similar levels of citric acid. For the rest, the amount of citric acid is very dispersed, although the median citric acid quantity for low quality wines is very low.

It looks like the higher the amount of alcohol content in a wine, the better the score it receives, but this effect only appears in wines with a quality of six or more, having the rest similar median values. There are a lot of outliers with a high percent of alcohol between the wines of quality five.

The amount of sulphates is slightly positively correlated with the quality of the wine, but the effect is not as pronounced as with the other variables mentioned above. There are a lot of outliers.

It seems to exist a mild negative correlation between ‘density’ and ‘quality’. I doubt the experts can detect such small variations in density between different wines, or even care about it. My guess is that this is due to ‘density’ being correlated with other influential variables, like ‘alcohol’, or just pure randomness.

This is similar to the last case. Can we detect with our sense of taste differences in pH of one unit maximum?. Maybe this is caused by the existing negative correlation between ‘pH’ and ‘citric.acid’.

Both ‘volatile.acidity’ and ‘pH’ are negatively correlated with ‘citric.acid’ (-0.552 and -0542, respectively). The latter makes sense: low pH values indicate acidity.

The relation between acetic acid (volatile.acidity) and citric acid is not that clear.

High levels of alcohol are asociated with low density (-0.496), which makes sense, since alcohol is less dense than water.


Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

I found it interesting that higher alcohol content had a higher probability of getting a good quality score. Also, sugar didn’t have much impact on the quality of the wine.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I noticed that density and alcohol had a stronger negative correlation than others.

What was the strongest relationship you found?

pH and fixed acidity

Multivariate Plots Section

Plot for Quality by Volitile Acidity and Alcohol


I tried to make the colors distinct here and I still can’t see a clear pattern. Maybe citric acid and alcohol together can predict quality?

There is a little bit of a pattern where the dots get redder up and to the right, but it really doesn’t look like much of a pattern. At this point I think picking the two variables with the highest correlation coefficients might reveal something.


Alcohol by Chlorides for Differing Quality Red Wines


Alcohol Content by Wine Quality with multi boxplots

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The only relationship that really saw was with that last plot. You can tell that as the alcohol increases and the volitile acidity decreases, the quality increases.

Were there any interesting or surprising interactions between features?

Nope

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

From the above box plots, we can see the average Alcohol in each quality range, where the high quailty contains more alcohol (Volumn)

Plot Two

Description Two

The above scatter plot descibes the amount of Chloride and Alcohol on every red wine, and also shows how they effect the quality of the wine. We can see that the highest wine quality contains less Chlorides and more Alcohol.

Plot Three

Description Three

This chart shows how quality improves as the alcohol content increases and the volitile acidity decreases. There is an overall trend of the colors getting darker as they go to the bottom right.

Reflection

This dataset has 11 physiochemical properties of 1599 red wines.

For the Uni plots, I used line plot to see the curve of the quality which was very messy and hard to read so I added geom_smooth to observe it easer. Also I used cor.test to see the correclation between the chemicals(physiochemical) and quality.

For the Bivariate plot, I used scatter plot to find the realtionship between various variables, also I used smooth to make it easer to read and understand with method ‘lm’ (linear model)

For the Multivariate plot, which was the hardest for me :( I also used jitter plot and used made the color equal to different variables so I can read it easier and make it more meaningful.

The strugles that I faced was understand wine and it’s different physiochemical that are used in it, it was hard for me to choose which var I’ll use to graph becuase I found it hard to understand them and I’m not a wine drinker :)

The only suprise I found was sugar not having a high impact on the wine quality, becuase normally in any food, when you have high amount of sugar it’s hard to eat it and it becomes tasteless, but in this dataset, it’s different.

Everything went well in this project after fully understating the variables

I think this is a short dataset without a limited numbers of obs, I think in the future if it had like +50k obs we would fully understand different impacts on wine quality.

AND THAT’S IT!!! THANK YOU VERY MUCH FOR THIS INTERSTING PROJECT <3